
/**************************************************************************
			*eCEO Source code*
***************************************************************************/
The software in this package is provided "as is" without further support. However, we are still interested in hearing what and how you use them. We welcome comments, suggestions and collaboration, please contact:
Zhengkui Wang (wangzhengkui@nus.edu.sg)

The README file covers the following topics 
1. Prerequisites
2. Input Format
3. How to compile the source code
4. How to run the software in hadoop cluster 
5. Examples
6. How to run on Cloud

===============
1. Prerequisits
===============

Hadoop version: 0.20.2

===============
2. Input Format
===============
The user should create a space-delimited file which contains the case-control genotype data as the input for the program. 
The first column is the sample ID marked as integer with each line corresponding to one individual. 
The last column should contain the disease status of each individual coded by 0 and 1. 
From the second column to the last two column, it should be the genotype data which should be coded by 0, 1 and 2. 
The following is a sample data file for 5 individauls (3 cases and 2 controls) each genotyped for 8 SNPs. 

1 2 1 0 2 1 0 1 0 1 
2 0 1 2 0 2 1 2 2 1
3 1 2 0 1 2 0 1 1 1
4 0 2 1 2 1 2 0 0 0
5 1 0 1 1 2 1 2 1 0

***Users can modify the source code for different kinds of data. We will release the software which can support for different kinds of data formats soon. ***

=================================
3. How to compile the source code
=================================

Import the source code to your Java programming tools, like eclipse.
Modify the source code arrcoding to your request. 
When it is ready, in your project, right click choose: /export->java->jar file->sect the export destination->Finish
The compiled Jar file will be generated. 

================================================
4. How to run the jar file in your Hadoop cluster
================================================  

1). Create the input folder in the HDFS filesystem
like: >>hadoop fs -mkdir inputfolder/ 

2). Put the data file into the input folder in the file system. Using the following command: 
like >>Hadoop fs -put localdatafile inputfolder/

3). Run the program using the following command: 
like: >> hadoop jar jar_path/***.jar sg.edu.nus.GeneProcessor inputfolder/ preprocess_output_folder/ two-locus_analysis_output_folder/ three-Locus_analysis_output_folder TopK_Retrieval_From_two-locus_output_folder 

There are total four args in the execution command. 
1st arg is the orignal data path. 
2end arg is the result output path after data preprocessing. 
3rd arg is the result output path after the two-locus analysis 
4th arg is the result output pather for retrievaling the top k result from two-locus analysis result data. 

====================
5. Example
====================
We have provided users a tony 
In this folder we provide a Genesnp.jar file which is compiled jar file for 100 SNPs. The data100.txt is the original data for 100 SNPs data from 2000 samples. For a simple testing, users can use the following commands:

The dataformat must be the same as the following:

$HADOOP_HOME/bin/hadoop fs -mkdir genesnp/input100
$HADOOP_HOME/bin/hadoop fs -put data100.txt genesnp/input100/

$HADOOP_HOME/bin/hadoop jar GeneSnp_greedy.jar(or GeneSnp_squarechopping.jar) sg.edu.nus.GeneProcessor genesnp/input100 genesnp/output1 genesnp/output2 genesnp/output3 genesnp/output4

The result data after preprocessing will be stroed under the folder of genesnp/output1/ in HDFS
The result data after two-locus analysis will be stroed under the folder of genesnp/output2/ in HDFS
The result data after three-locus analysis will be stroed under the folder of genesnp/output3/ in HDFS
The result data after top k retirival will be stroed under the folder of genesnp/output4/ in HDFS

=================================
6. How to run on Cloud
=================================

Besides of running the program on your own hadoop cluster, users can run it on cloud also. Users can use all kinds of hadoop cluster provided by any cloud providers, like Amazon Elastic Compute Cloud (EC2), Amazon Elastic MapReduce etc. 

1) get hadoop cluster running on cloud
	a) For how to use Amazon Elastic Compute Cloud, users can download the documents from http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/. 
We suggest user use the tool under the hadoop package to launch the cluster with hadoop setup. The tool is under the hadooppackage/src/contrib/ec2/bin. The instructions can be found there also. Or users can find the instructions from http://wiki.apache.org/hadoop/AmazonEC2#AutomatedScripts. 
	b) For using Amazon Elastic MapReduce, users can download the documents from http://aws.amazon.com/elasticmapreduce/
2) How to run our program

	a) Upload your jar file and your data to the cloud. 
	b) Then run the jar file exactly the same as running on your own cluster. Instructions can be found above. 

We welcome all the collaborations!!!

